-
Notifications
You must be signed in to change notification settings - Fork 26
feat: support new metrics firehose api with get_usage()
#404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1609a9b
to
24f9dc3
Compare
#' @description Get content usage data. | ||
#' @param from Optional `Date` or `POSIXt`; start of the time window. If a | ||
#' `Date`, coerced to `YYYY-MM-DDT00:00:00` in the caller's time zone. | ||
#' @param to Optional `Date` or `POSIXt`; end of the time window. If a | ||
#' `Date`, coerced to `YYYY-MM-DDT23:59:59` in the caller's time zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is doing what we expect with timezones. And actually, I'm not even totally sure what is intended. So maybe we should work that out here and then adapt the code to fit? When we send timestamps to Connect with this function do we want them to be transformed to UTC from the caller's local timezone before being sent? Or some other behavior?
One thing to note: If I'm reading this correctly, make_timestamp()
has slightly different behavior if one sends a non-character than if one sends a character input. Where the character string version will not be parsed and also not be transformed into UTC. And so this function will do the same.
Lines 16 to 28 in e8c8075
make_timestamp <- function(input) { | |
if (is.character(input)) { | |
# TODO: make sure this is the right timestamp format | |
return(input) | |
} | |
# In the call to `safe_format`: | |
# - The format specifier adds a literal "Z" to the end of the timestamp, which | |
# tells Connect "This is UTC". | |
# - The `tz` argument tells R to produce times in the UTC time zone. | |
# - The `usetz` argument says "Don't concatenate ' UTC' to the end of the string". | |
safe_format(input, "%Y-%m-%dT%H:%M:%SZ", tz = "UTC", usetz = FALSE) | |
} |
@@ -101,6 +105,65 @@ parse_connectapi <- function(data) { | |||
)) | |||
} | |||
|
|||
# nolint start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What linting are we escaping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, commented code — it's not roxygen2
docs, just a regular comment, because it's not an exported function.
@@ -365,3 +374,86 @@ test_that("get_content only requests vanity URLs for Connect 2024.06.0 and up", | |||
) | |||
}) | |||
}) | |||
|
|||
test_that("get_usage() returns usage data in the expected shape", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test where we pass in a character that is a formatted timestamp? (if we are intending to support that, of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
fast_unnest_character <- function(df, col_name) { | ||
if (!is.character(col_name)) { | ||
stop("col_name must be a character vector") | ||
} | ||
if (!col_name %in% names(df)) { | ||
stop("col_name is not present in df") | ||
} | ||
|
||
list_col <- df[[col_name]] | ||
|
||
new_cols <- names(list_col[[1]]) | ||
|
||
df2 <- df | ||
for (col in new_cols) { | ||
df2[[col]] <- vapply( | ||
list_col, | ||
function(row) { | ||
if (is.null(row[[col]])) { | ||
NA_character_ | ||
} else { | ||
row[[col]] | ||
} | ||
}, | ||
"1", | ||
USE.NAMES = FALSE | ||
) | ||
} | ||
|
||
df2[[col_name]] <- NULL | ||
df2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data returned from the endpoint includes path and user_agent fields nested under a data field. Without special treatment these are returned a list-column, which is awkward. I initially experimented with
tidyr::unnest()
, but that was slow on the larger datasets returned by this endpoint, so I wrote a customfast_unnest_character()
function which runs in about 5% (!) of the time thattidyr::unnest()
takes.
Thinking about this a bit more: this isn't a huge chunk of code of course, but it is another chunk that we will take on the maintenance of if we go this route. This is another example where having our data interchange within connectapi be all data frames means we have to worry about the performance of json-parsed list responses into data frames and make sure those data frames are in a natural structure for folks to use. If we relied instead on only the parsed list data as our interchange and then gave folks as.data.frame()
methods, we could defer the (sometimes expense) reshaping until late in process and eaking out performance gains like this are much less important so we can rely on more off the shelf tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see what you mean, and this does align with what you've been saying about other server objects.
I hear what you're saying about parsing to data frames for data interchange and I think that approach would be great to use for, say, the integrations endpoints that I just added stories for.
For the data from the hits endpoint, presenting it as anything other than a data frame goes back to feeling kinda not-R-idiomatic, as it isn't data that can… hmm…
So definitely one of the tasks, and maybe the main task that I can imagine for this data is to, like, treat it as a data frame and filter, plot, etc., it. But another thing you might want to do is, like, get the content item associated with this hit. And yeah, in that case, you might just want to be able to pass the hit
, or hit$content_guid
to content_item()
.
I still think we might want to keep code like this around in an as.data.frame()
method — for making an unnested data frame out of nested data from the Connect API, tidyr::unnest()
was 20x slower (which I was surprised by! it seems like a wild differential).
Open to a variety of options — let's discuss what the best approach would be to finalize and merge this PR.
#' `Date`, coerced to `YYYY-MM-DDT00:00:00` in the caller's time zone. | ||
#' @param to Optional `Date` or `POSIXt`; end of the time window. If a | ||
#' `Date`, coerced to `YYYY-MM-DDT23:59:59` in the caller's time zone. | ||
inst_content_hits = function(from = NULL, to = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this method name benefits from having the inst_
prefix on it.
More of a philosophical musing, but I also am not sure it's worth packing more things into methods on the Connect
R6 object. get_usage()
is a pretty thin wrapper around this. We could just put this inside of get_usage()
, not sure there's much benefit to separating things like this. I know this is a pattern in parts of the package, but that doesn't mean we have to keep doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that I don't think the method name at all benefits from inst_
prefix; to the reader, it isn't at all clear what it means.
I would also be completely fine with beginning to move everything into the functions; I agree that I don't think we get much benefit at all from having the logic in the R6 object. In fact, I think it adds confusion: it was hard for me to decide where we want different bits of logic to live, i.e. within the method or the outer function. It's kind arbitrary.
I guess you could argue that the goal of the methods is to enumerate the available endpoints that operate on different objects, but it feels like a pretty verbose way to do that.
#' | ||
#' The data returned by `get_usage()` includes all content types. For Shiny | ||
#' content, the `timestamp` indicates the *start* of the Shiny session. | ||
#' Additional fields for Shiny and non-Shiny are available respectively from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the additional fields not included in this endpoint?
#' If no date-times are provided, all usage data will be returned. | ||
|
||
#' @param client A `Connect` R6 client object. | ||
#' @param from Optional `Date` or date-time (`POSIXct` or `POSIXlt`). Only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's optional, what happens if I omit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why am I having so much trouble coming up with the language for this? 😂
If it's omitted, the time constraint is unbounded on the lower side.
#' | ||
#' When possible, however, we recommend using `get_usage()` over | ||
#' `get_usage_static()` or `get_usage_shiny()`, as it will be much faster for | ||
#' large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "large datasets" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, not large in the "big data" sense, but like, to fetch 100,000 records on the old API requires 200 requests, which starts to lead to delays of a multiple seconds scaling linearly. Perhaps best to just not describe it?
@@ -101,6 +105,65 @@ parse_connectapi <- function(data) { | |||
)) | |||
} | |||
|
|||
# nolint start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code maybe?
"b0eaf295" | ||
), | ||
timestamp = c( | ||
parse_connect_rfc3339("2025-04-30T12:49:16.269904Z"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't parse_connect_rfc3339
vectorized?
# Dates are converted to timestamps with the system's time zone, so for | ||
# repeatability we're gonna set it here. | ||
|
||
withr::local_envvar(TZ = "UTC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, see #406 though
Intent
Adds support for the metrics "firehose" API.
Fixes #390
Approach
Mostly straightforward and following existing patterns in
connectapi
, with a few strongly-opinionated choices that I hold loosely and would be open to pushback on.from
andto
parameters. If these are timestamps, they are used as-is. If dates (with no specified times) are passed,from
is treated as "start of day" andto
is treated as "end of day". I think this will match user expectations — imagine selecting a start and an end date in a date range picker.path
anduser_agent
fields nested under adata
field. Without special treatment these are returned a list-column, which is awkward. I initially experimented withtidyr::unnest()
, but that was slow on the larger datasets returned by this endpoint, so I wrote a customfast_unnest_character()
function which runs in about 5% (!) of the time thattidyr::unnest()
takes.parse_connectapi()
function could see drastic performance improvements this way too, and those would directly speed up the performance of things like like themegadashboard
. I have been meaning to look into this for a minute (investigateparse_connectapi_typed
performance #383).connectapi
was constructingNA_datetime_
combined with itsvctrs
-based parsing approach was forcefully converting all timestamps to UTC regardless of other inner functions.Connect
class just now inched over the limit. Not worth even having it enabled at that point.Checklist
NEWS.md
(referencing the connected issue if necessary)?devtools::document()
?